The dataset “Groceries” contains all transactions that consists of items bought in the store by several customers over a period of time. As business analysts, we hope to identify trends in customer purchase behavior by analyzing their basket data.
Association Rule Mining
Association Rule Mining is used when you want to find an association between different objects in a set, find frequent patterns in a transaction database, relational databases or any other information repository. The applications of Association Rule Mining are found in Marketing, Basket Data Analysis (or Market Basket Analysis) in retailing, clustering and classification. It can tell you what items do customers frequently buy together by generating a set of rules called Association Rules. In simple words, it gives you output as rules in form of this then that.
The functions in the arules package only accept transaction data. Hence the flat file needs to processed to ensure that the input to the functions is as expected.
retail = scan('C:/Users/namit/Desktop/MSBA/Intro to Predictive Analytics/STA380-master/data/groceries.txt',what="", sep='\n')
head(retail)
## [1] "citrus fruit,semi-finished bread,margarine,ready soups"
## [2] "tropical fruit,yogurt,coffee"
## [3] "whole milk"
## [4] "pip fruit,yogurt,cream cheese ,meat spreads"
## [5] "other vegetables,whole milk,condensed milk,long life bakery product"
## [6] "whole milk,butter,yogurt,rice,abrasive cleaner"
str(retail)
## chr [1:9835] "citrus fruit,semi-finished bread,margarine,ready soups" ...
summary(retail)
## Length Class Mode
## 9835 character character
groceries = strsplit(retail,",")
groctrans=as(groceries, "transactions")
summary(groctrans)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55
## 16 17 18 19 20 21 22 23 24 26 27 28 29 32
## 46 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
Now we want to try to understand which items in the baskets are most frequently purchased. For this, we plot the itemFrequencyPlot calculated with type ‘absolute’ and ‘relative’. If absolute it will plot numeric frequencies of each item independently. If relative it will plot how many times these items have appeared as compared to others.
# Create an item frequency plot for the top 20 items
if (!require("RColorBrewer")) {
# install color package of R
install.packages("RColorBrewer")
#include library RColorBrewer
library(RColorBrewer)
}
## Loading required package: RColorBrewer
itemFrequencyPlot(groctrans,topN=20,type="absolute",col=brewer.pal(8,'Pastel2'), main="Absolute Item Frequency Plot")
itemFrequencyPlot(groctrans,topN=20,type="relative",col=brewer.pal(8,'Pastel2'),main="Relative Item Frequency Plot")
From the graphs above, it is evident that the top 5 items being purchased are:
1. whole milk
2. vegetables
3. rolls/buns
4. soda
5. yogurt
(Basic survival kit!)
The next step is generating rules using the Apriori Algorithm! We can mine rules using apriori() function from the arules package.
The apriori will take ‘groctrans’ as the transaction object on which mining is to be applied. parameter will allow you to set min_sup and min_confidence. The default values for parameter are minimum support of 0.001, the minimum confidence of 0.8, maximum of 10 items (maxlen).
We arrived at these values by trial and error. While a lower confidence value helped result in more number of mining rules, we wanted to make sure that we are confident with the predictions we were making from the rules.
association.rules <- apriori(groctrans, parameter = list(supp=0.001, conf=0.8,maxlen=10))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [410 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(association.rules)
## set of 410 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6
## 29 229 140 12
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 4.329 5.000 6.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.001017 Min. :0.8000 Min. : 3.131 Min. :10.00
## 1st Qu.:0.001017 1st Qu.:0.8333 1st Qu.: 3.312 1st Qu.:10.00
## Median :0.001220 Median :0.8462 Median : 3.588 Median :12.00
## Mean :0.001247 Mean :0.8663 Mean : 3.951 Mean :12.27
## 3rd Qu.:0.001322 3rd Qu.:0.9091 3rd Qu.: 4.341 3rd Qu.:13.00
## Max. :0.003152 Max. :1.0000 Max. :11.235 Max. :31.00
##
## mining info:
## data ntransactions support confidence
## groctrans 9835 0.001 0.8
inspect(association.rules[1:10])
## lhs rhs support confidence lift count
## [1] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 11.235269 19
## [2] {cereals,
## curd} => {whole milk} 0.001016777 0.9090909 3.557863 10
## [3] {cereals,
## yogurt} => {whole milk} 0.001728521 0.8095238 3.168192 17
## [4] {butter,
## jam} => {whole milk} 0.001016777 0.8333333 3.261374 10
## [5] {bottled beer,
## soups} => {whole milk} 0.001118454 0.9166667 3.587512 11
## [6] {house keeping products,
## napkins} => {whole milk} 0.001321810 0.8125000 3.179840 13
## [7] {house keeping products,
## whipped/sour cream} => {whole milk} 0.001220132 0.9230769 3.612599 12
## [8] {pastry,
## sweet spreads} => {whole milk} 0.001016777 0.9090909 3.557863 10
## [9] {curd,
## turkey} => {other vegetables} 0.001220132 0.8000000 4.134524 12
## [10] {rice,
## sugar} => {whole milk} 0.001220132 1.0000000 3.913649 12
From the above information, we can deduce that 100% of the customers who bought Rice and sugar also bought Whole Milk.
The confidence values of other rules in the above summary are also pretty high (>0.8).
You can also remove redundant rules by creating a subset of unique rules.
# Min Support as 0.005, confidence as 0.8.
association.rules <- apriori(groctrans, parameter = list(supp=0.001, conf=0.8,maxlen=10))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [410 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(association.rules)
## set of 410 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6
## 29 229 140 12
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 4.329 5.000 6.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.001017 Min. :0.8000 Min. : 3.131 Min. :10.00
## 1st Qu.:0.001017 1st Qu.:0.8333 1st Qu.: 3.312 1st Qu.:10.00
## Median :0.001220 Median :0.8462 Median : 3.588 Median :12.00
## Mean :0.001247 Mean :0.8663 Mean : 3.951 Mean :12.27
## 3rd Qu.:0.001322 3rd Qu.:0.9091 3rd Qu.: 4.341 3rd Qu.:13.00
## Max. :0.003152 Max. :1.0000 Max. :11.235 Max. :31.00
##
## mining info:
## data ntransactions support confidence
## groctrans 9835 0.001 0.8
inspect(association.rules[1:10])
## lhs rhs support confidence lift count
## [1] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 11.235269 19
## [2] {cereals,
## curd} => {whole milk} 0.001016777 0.9090909 3.557863 10
## [3] {cereals,
## yogurt} => {whole milk} 0.001728521 0.8095238 3.168192 17
## [4] {butter,
## jam} => {whole milk} 0.001016777 0.8333333 3.261374 10
## [5] {bottled beer,
## soups} => {whole milk} 0.001118454 0.9166667 3.587512 11
## [6] {house keeping products,
## napkins} => {whole milk} 0.001321810 0.8125000 3.179840 13
## [7] {house keeping products,
## whipped/sour cream} => {whole milk} 0.001220132 0.9230769 3.612599 12
## [8] {pastry,
## sweet spreads} => {whole milk} 0.001016777 0.9090909 3.557863 10
## [9] {curd,
## turkey} => {other vegetables} 0.001220132 0.8000000 4.134524 12
## [10] {rice,
## sugar} => {whole milk} 0.001220132 1.0000000 3.913649 12
subset.rules <- which(colSums(is.subset(association.rules, association.rules)) > 1) # get subset rules in vector
length(subset.rules) #410 ----> 91
## [1] 91
subset.association.rules. <- association.rules[-subset.rules] # remove subset rules.
A straight-forward visualization of association rules is to use a scatter plot using plot() of the arulesViz package. It uses Support and Confidence on the axes. In addition, third measure Lift is used by default to color (grey levels) of the points.
subRules<-association.rules[quality(association.rules)$confidence>0.4]
#Plot SubRules
plot(subRules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
Rules with high lift tend to have low support.
plot(subRules,method="two-key plot")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
The two-key plot uses support and confidence on x and y-axis respectively. It uses order for coloring. The order is the number of items in the rule.
Here is a parallel coordinates plot for 20 rules. For example, the arrow in red denotes that if a customer purchases red/blush wine they will also buy bottled beer.
# Filter top 20 rules with highest lift
subRules2<-head(subRules, n=20, by="lift")
plot(subRules2, method="paracoord")
Given below are a few interactive visualizations to see more details between the rules and the items.
plotly_arules(subRules)
## Warning: 'plotly_arules' is deprecated.
## Use 'plot' instead.
## See help("Deprecated")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
top10subRules <- head(subRules, n = 10, by = "confidence")
plot(top10subRules, method = "graph", engine = "htmlwidget")